Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: Nemo_bis Date: Mar 11, 2014 4:15am
Forum: web Subject: Google's robots.txt rules interpreted too strictly by Wayback machine

https://web.archive.org/web/*/https://groups.google.com/a/googleproductforums.com/forum/#!forum/books says "Page cannot be crawled or displayed due to robots.txt".
However their robots.txt contains
Allow: /a/
and says
# robots.txt for Google Groups. See this URL for documentation on robots.txt:
# https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
# Note in particular that "the most specific rule based on the length of the
# [path] entry will trump the less specific (shorter) rule."

Edit: There are a few bug reports against heritrix about these sort of rules and most of them are closed, I don't know where the problem lies.
https://webarchive.jira.com/browse/HER-1880
https://webarchive.jira.com/browse/HER-1 / https://webarchive.jira.com/browse/HER-377
https://webarchive.jira.com/browse/HER-1620

This post was modified by Nemo_bis on 2013-11-11 07:28:37

This post was modified by Nemo_bis on 2014-03-11 11:15:46